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ABSTRACT 

This paper presents a new methodology for automatic knowl- 
edge driven data mining based on the theory of Mercer Ker- 
nels, which are highly nonlinear symmetric positive definite 
mappings from the original image space to a very high, 
possibly infinite dimensional feature space. We describe 
a new method called Mixture Density Mercer Kernels to 
learn kernel function directly from data, rather than using 
pre-defined kernels. These data adaptive kernels can en- 
code prior knowledge in the kernel using a Bayesian formu- 
lation, thus allowing for physical information to be encoded 
in the model. We compare the results with existing algo- 
rithms on data from the Sloan Digital Sky Survey (SDSS). 
The code for these experiments has been generated with 
the AutoBayes tool, which automatically generates effi- 
cient and documented C/C++ code from abstract statisti- 
cal model specifications. The core of the system is a schema 
library which contains templates for learning and knowl- 
edge discovery algorithms like (different versions of EM, or 
numeric optimization methods like conjugate gradient meth- 
ods. The template instantiation is supported by symbolic- 
algebraic computations, which allows AutoBaYES to find 
closed-form solutions and, where possible, to integrate them 
into the code. The results show that the Mixture Density 
Mercer Kernel described here outperforms tree-based clas- 
sification in distinguishing high-redshift galaxies from low- 
redshift galaxies by approximately 16% on test data, bagged 
trees by approximately 7%, and bagged trees built on a much 
larger sample of data by approximately 2%. 

Categories and Subject Descriptors 

D.2.m [Software Engineering] : Miscellaneous — Rapid Pro- 
totyping; G. 3 [Probability and Statistics]; H.2.8 [Data- 
base Management]: Database Applications — Data min- 
ing, Scientific databases; 1.1.3 [Symbolic and Algebraic 
Manipulation]: Languages and Systems — Special purpose 
algebraic systems; 1.2.2 [Artificial Intelligence]: Auto- 


matic Programming — Program synthesis ; 1.2.6 [Artificial 
Intelligence], Learning — Parameter learning; 1.5.1 [Pattern 
Recognition]: Models — Statistical; 1.5.3 [Pattern Recog- 
nition]: Clustering — Algorithms; J.2 [Physical Sciences 
and Engineering]: Astronomy 

General Terms 

Bayesian Statistics, Mercer Kernels 

1. INTRODUCTION 

There is a growing interest in the machine learning and data 
mining communities in the field of Mercer Kernels due to 
their mathematical properties as well as their use in Sup- 
port Vector Classifiers and Regressors. The theory of Mer- 
cer Kernels allows data which may be embedded in a vector 
space, such as spectral lines, physical measurements, stock 
market indices, or may not arise from a vector space, such 
as sequences, graphs, and trees to be treated using simi- 
lar mathematics. Work by Haussler [13] shows how to map 
sequences of symbols into a feature space using kernel sim- 
ilarity measures. In the same paper, Haussler introduced 
the idea of a Probabilistic Kernel function, or P-Kernel that 
obeys Mercer’s conditions (i.e., the kernel function must be 
symmetric and positive definite) and is defined as follows: 

K(x„x : ,) = P(x l |©)P(x J |e) (1) 

This kernel function says that two points are similar if they 
are both more likely given the model 0. Thus, data points 
lying in 7 Z n which may be far away from each other in the 
Euclidean sense may turn out to be ’similar’ as measured 
by this kernel. We generalize this notion of similarity using 
Mixture Density Mercer Kernels. 

In a recent paper [18], the notion of Mixture Density Mercer 
Kernels was introduced. The idea is to express the distribu- 
tion function P(x*) in terms of a full Bayesian formulation of 
a density function. The kernel function is created by taking 
bootstrap aggregate samples models based on the distribu- 
tion function. Thus, for one bootstrap sample, we have: 

c 

P(x,|e) = £P(c)P(x,|0 c ) (2) 

C — 1 

Due to the Bayesian formulati on, prior distributions c.-ui be 
placed on the model parameters for each bootstrap sample. 
This allows us to encode domain knowledge into each model. 



The kernel function is then composed of the sum of the outer 
products of the class membership matrices. Thus, we have: 


K(x,,Xj) 
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where K represents the a sum of M bootstrap samples. 

(x, ) is a composite class membership function, where each 
member of the composite is the posterior class distribution 
for a model. Thus, for M models, we have: 


$(xi) oc [Pi{c = l|xi),Pi(c = 2|xi), . . . , 

Pi(c = C|xi),P 2 (c = l|xi), . . . , Pm(c = C|xi)] 

In the hard clustering case, where the posterior class distri- 
bution for a given model is a zero-one vector, the ( i,j ) ele- 
ment of the Mixture Density Mercer Kernel describes how 
many times, on average, the M models agreed that data 
points x, and Xj arose from the same mode in the density 
function. 


• N is the number of data points Xj drawn from a D 
dimensional space 


• M is the number of probabilistic models used in gen- 
erating the kernel function. 


• C is the number of mixture components in each prob- 
abilistic model. In principle one can use a different 
number of mixture components in each model. How- 
ever, here we choose a fixed number for simplicity. 

• Xi is a p x 1 dimensional real column vector that rep- 
resents the data sampled from a data set X. 


• < t'(x) : TV i— > p is generally a nonlinear mapping to 
a high, possibly infinite dimensional, feature space T . 
This mapping operator may be explicitly defined or 
may be implicitly defined via a kernel function. 


As is the case with ensemble methods, the greater the vari- 
ability in these models, the better the performance of the 
overall model. We demonstrate the degree of variability in 
the models due to different initial conditions in terms of 
variations in the converged likelihood value. This paper elu- 
cidates the idea and demonstrates its feasibility in working 
with a large astronomical data set known as the Sloan Dig- 
ital Sky Survey. 

The information required to construct the kernel function 
(i.e., the class membership matrix) can be computed by an 
application of the EM-algorithm [9] on a suitable training 
set. A number of EM-implementations is available (e.g., Au- 
toclass [6, 7], EMMIX [15], MCLUST [11]) and any of them 
could be used. However, in order to insert domain knowl- 
edge into the kernel matrix, the EM-code has to be modified 
accordingly; this is time-consuming and error-prone. More- 
over, since the choice of a particular prior has consequences 
for the quality of the kernel matrix, a certain amount of ex- 
perimentation is necessary. However, the exact form of the 
prior can also have substantial consequences on the details 
of the implementation (e.g., the form of the M-step or the 
internal data structures) which magnifies the implementa- 
tion problem. Fortunately, the overall structure of the algo- 
rithm remains the same and the details can be derived me- 
chanically. Here, we have applied AutoBayes to produce 
the different variants of the EM-algorithm. AutoBayes 
[5, 10, 12] is a fully automatic program synthesis system 
that generates efficient and documented C/C++ code from 
abstract statistical model specifications. The core of the 
system is a schema library which contains templates for 
learning and knowledge discovery algorithms like different 
versions of EM, and numerical optimization methods. The 
template instantiation is supported by symbolic-algebraic 
computations, which allows AutoBayes to find closed-form 
solutions and, where possible, to integrate them into the 
code from the templates. 

2. NOTATION 

• D is the dimension of the data 


• K(xi,Xj) = $(x,)$ r (x j ) e n is the kernel function 
that measures the similarity between data points x, 
and Xj. If K is a Mercer kernel, it can be written as 
the outer product of the map < J> As i and j sweep 
through the N data points, it generates an N x N ker- 
nel matrix. 

• p c is the mixture weight for the c th mixture, and q(i, c) 
is the posterior probability of class membership, i.e., 
q(i, c) = P(c|xi) 

• © = (p, p, <t) is the entire set of parameters that spec- 
ify a mixture model. 

3. MIXTURE DENSITY MERCER KERNELS 

The Mixture Density Mercer Kernel function given in (3) 
is similar to the Cluster-based Similarity Partitioning Algo- 
rithm (CSPA) discussed in [1]. While their implementation 
uses hard clustering, it can be extended to the soft clus- 
tering (expectation-maximization) approach described here. 
An important difference between this work and previous 
work, however, is that we intend to use our kernel function 
in support vector machines for classification and regression. 
The modularity of SVMs allow different kernels to be imple- 
mented that model the underlying data generating process 
in different ways. 

The Mixture Density Mercer Kernel is built using an en- 
semble of Bayesian mixture density models. The Bayesian 
formulation allows for prior information to be encoded in the 
model. Then, rather than computing a maximum-likelihood 
estimator, we compute a maximum a posteriori estimator 
which includes the likelihood function and the prior. The 
greater the heterogeneity of the models used in generat- 
ing the kernel, the more effective the procedure. In the 
AutoBayes implementation of the. procedure, the training 
data is sampled M times with replacement. These overlap- 
ping data sets, combined with random initial conditions for 


the EM algorithm, aid in generating a heterogenous ensem- 
ble. 


form: 


In this work, we assume a Gaussian distribution as the model 
for each class, with priors expressed in terms of a conjugate 
prior for the Gaussian distribution. A conjugate prior is 
defined as a family F of probability density functions such 
that for every / e F, the posterior /(©|x) also belongs to 
F. 

For a mixture of Gaussians model, priors can be set as fol- 
lows [3]. For priors on the means, either a uniform distribu- 
tion or a Gaussian distribution can be used. For priors on 
the covariance matrices, the Wishart density can be used: 
P(Ei|a, ft J) oc exp(— atr(S~ 1 J)/2). For priors on 

the mixture weights, a Dirichlet distribution can be used: 
P(pi I7) oc n? = iPr\ w fr ere P> = P(c = »)• Maximum a 
posteriori estimation is performed by taking the log of the 
posterior likelihood of each data point x* given the model 
0. The following function is thus optimized using the Ex- 
pectation Maximization [8] for a Gaussian mixture model 
with priors on the means only. 

For example, the log posterior probability 


l°g(P(p > x I c,<7 2 ) x P(c | p )) 
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After dropping the terms in the first line (which are indepen- 
dent of the goal variables), and introducing the Lagrange- 
multipher A we get the following final form of the log-likelihood 
function. 
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for a model with conjugate priors on the means is thus com- 
puted as follows. The first step is to marginalize (i.e., sum 
out) the latent variable c via the expectation q. However, 
to keep this step tractable, it is important to delay the ac- 
tual summation as long as possible. We thus introduce the 
“delayed summation” operator Edom e ---9, which gives us 


» -> t — 1 ...N 

* -^dom Cj ~9i 


logP(p,x j c,a 2 ) + 53 logP(c i p) 

* 1 rf dom 


This rather onerous derivation of the likelihood function (in- 
cluding the MjjX-code for the displayed formulae!) was gen- 
erated fully automatically by AutoBayeS. It is important 
to note that although this likelihood function is much more 
complicated than the likelihood function for a model with 
no priors, it does not change the underlying parametric sta- 
tistical description of the data Thus, the slightly increased 
computational burden is only seen at the model building 
stage, but not at the model evaluation stage. 


We can then apply the product rule to decompose the proba- 
bilities and then replace them by the density functions. This 
gives us the formidably looking log-likelihood function 
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Now we can actually execute the delayed summations; the 
crucial step is to replace all occurrences of the latent vari- 
able inside the body of the c*~q by appropriately re- 

indexed' and weighted occurences. For example, this step 
simplifies EE£~ 7 , lo gIl£ P cj into EZi logIl£ 93,. Pc,- 
After further simplifications, we then get the more tractable 


The art of choosing priors is one of much study in Bayesian 
data analysis. As will be seen later in this work, we choose 
priors based on human knowledge about the domain prob- 
lem as well as from various visualizations of the data that 
indicate where modes should be placed. In the problem of 
classifying low and high redshift galaxies, we only include 
priors on the means of the Gaussians, rather than on the 
mixture weights or the covariance matrices. 

4. AUTOBAYES 

AutoBayeS is a fully automatic program synthesis system 
for the data analysis domain. It is implemented in Pro- 
log and comprises about 80,000 lines of documented code. 
From the outside, it looks similar to a compiler: it takes 
an abstract problem specification in the form of a statisti- 
cal model and translates it into executable code. On the 
inside, however, it works quite different: AutoBayeS first 
derives a customized algorithm implementing the model and 
then transforms it into optimized C/C++ code implement- 
ing the algorithm. The algorithm derivation or synthesis 
process — which distinguishes AutoBayeS from traditional 
compilers — relies on_the intricate interplay of three key tech- 
niques. (i) AutoBayeS uses Bayesian networks (BNs) [4, 
IT] as a compact internal representation of the statistical 


model mog as 'Multivariate Mixture of Gaussians'; 

const int D := 5 as 'number of bands' 
const int N as 'number of data points’ 
with 1 < N; 

const int C as 'number of classes' 
with 1 < C; 
with C « N; 

double phi Cl.. C) as 'class probabilities’ 
with 1 = sum(_i := 1..C, phi(_i)); 
double mu(l. .D. 1..C); 
double sigma(l..D, 1..C); 

output int c(l..N) as 'latent variable’; 
c(_) " discrete(phi) ; 

data double x(l. -D. 1..N); 

x(_i,_j) * gauss(mu(_i,c(_j)) , signa(_i,c(_j))) ; 
max pr(x t {phi, mu, sigma}) wrt {phi, mu, sigma}; 

Figure 1: AuToBAYES-specification for Gaussian mix- 
ture model. Keywords are underlined. 

models. BNs provide an efficient encoding of the joint prob- 
ability distribution over all variables and thus enable replac- 
ing expensive probabilistic reasoning by faster graphical rea- 
soning. In particular, they speed up the decomposition of a 
problem into statistically independent simpler subproblems. 
(ii) AutoBayes uses program schemas as the basic build- 
ing oiocss lor the algorithm derivation. Schemas consist of a 
parameterized code fragment or template and a set of con- 
straints which are formulated as conditions on BNs. The 
templates can encapsulate advanced algorithms and data 
structures, which lifts the abstraction level of the algorithm 
derivation. The constraints allow the network structure to 
guide the application of the schemas, which prevents a com- 
binatorial explosion of the search space, (iii) AutoBayes 
contains a specialized symbolic subsystem which can find 
closed-form solutions for many problems and emerging sub- 
problems. The combination of these techniques results in a 
fast synthesis process which compares in speed to the com- 
pilation of the synthesized code. 

Specification Language. A statistical model describes the 
properties of the data in a fully declarative fashion: for each 
problem variable of interest (i.e., observation or parame- 
ter), properties and dependencies are specified via proba- 
bility distributions and constraints. Figure 1 shows how 
the standard Gaussian mixture model with diagonal covari- 
ance matrices can be represented in AutoBayes’s specifi- 
cation language. The model assumes that the data consists 
of N points in D dimensions such that each point belongs 
to one of C classes; the first few lines of the specification 
just declare these symbolic constants and specify the con- 
straints on them. However, instead of drawing each point 
x(l..C,_j) (where .. corresponds to Matlab’s : subrange 
operator, and _i, _j are index variables) from a multivari- 
ate Gaussian c (_j ) with a full Dx D-dimensional covariance 
matrix, each band _i is drawn independently from a univari- 
ate Gaussian with mean mu(_i,c(_j)) and standard devia- 
tion sigma (_i ,c(_j)). The unknown distribution parame- 
ters can be different for each class and each band; hence, we 
declare them -as matrices. The- unknown -assignment of the 
points to the distributions (i.e., classes) is represented by the 
latent variable c; since we are interested in the classification 


results as well (and not only the distribution parameters), 
c is declared as output, c is distributed as a discrete distri- 
bution with the relative class frequencies given by the also 
unknown vector phi. Since each point must belong to a 
class, the sum of the probabilities must be equal to one. 
Finally, we specify the goal inference task, maximizing the 
conditional probability pr(x I {phi, mu, sigma}) with re- 
spect to the parameters of interest, phi, mu, and sigma. This 
means that we are interested in a maximum likelihood esti- 
mate (MLE) of the model parameters; maximum aposteriori 
estimates (MAP) can be specified by adding priors to the 
model. Note that the model is completely declarative and 
does not require the user to prescibe any algorithmic aspects 
of the estimation program. AutoBayes is thus free to select 
any clustering algorithm that is applicable; however, users 
can force the derivation of specific solutions (e.g. k-means 
instead of EM) via command line parameters. 

Bayesian Networks. A Bayesian network is a directed, 
acyclic graph whose nodes represent random variables and 
whose edges define probabilistic dependencies between the 
random variables. AutoBayes uses hybrid BNs with plates 
[4] to represent the statistical models internally. Hence, 
nodes can represent discrete as well as continuous random 
variables. Plates generalize the concept of independent and 
identically distributed (i.i.d.) random variables and “col- 
lapse” collections of independent, co-indexed random vari- 
ables into graph nodes representing the non-repeated core 
structure; this keeps the graphs compact and the graphical 
reasoning routines (e.g., computing the parents, children, or 
Markov blanket [17] of a node) fast. Distribution and di- 
mension information for the random variables is attached to 
the respective nodes and plates. 

Program Schemas. A schema consists of a parameterized 
code fragment (i.e., template) and a set of constraints. The 
parameters are instantiated by AutoBayes, either directly 
or by calling itself recursively with a modified problem. The 
constraints determine whether a schema is applicable and 
how the parameters can be instantiated. Constraints are for- 
mulated as conditions on the Bayesian network or directly on 
the specified model; they include the maximization goal as 
special case. This allows the network structure to guide the 
application of the schemas and thus to constrain combinato- 
rial explosion of the search space, even if a large number of 
schemas is available. Schemas sire implemented as Prolog- 
clauses and search control is thus simply relegated to the 
Prolog-interpreter: schemas me tried in their textual order. 
This simple approach has not caused problems so far, mainly 
because the domain admits a natural layering which can be 
used to organize the schema library. The top layer comprises 
network decomposition schemas which try to break down the 
network into independent subnets, based on independence 
theorems for Bayesian networks. These are domain-specific 
divide-and-conquer schemas: the emerging subnets are fed 
back into the synthesis process and the resulting programs 
me composed to achieve a program for the original prob- 
lem. AutoBayes is thus able to automatically synthesize 
lmger programs by composition of different schemas. The 
next layer comprises more localized decomposition schemas 
which work on products of i.i.d. vmiables. Their applica- 
tion is also guided by the network structure but they re- 
quire more substantial symbolic computations. The core 



layer of the library contains statistical algorithm schemas as 
for example expectation maximization (EM) [9, 14] and k- 
Means (i.e., nearest neighbor clustering); these generate the 
skeleton of the program. The final layer contains standard 
numeric optimization methods as for example the Nelder- 
Mead simplex method or different conjugate gradient meth- 
ods. These are applied after the statistical problem has been 
transformed into am ordinary numeric optimization problem 
and if AutoBayes failed to find a symbolic solution for 
the problem. Currently, the library contains 28 top-level 
schemas, with a number of additional variations (e.g., dif- 
ferent initializations). 

Symbolic Subsystem. AutoBayes relies significantly on 
symbolic computations to support schema instantiation and 
code optimization. The core part of the symbolic subsys- 
tem implements symbolic-algebraic computations, similar to 
those in Mathematica [19]. It is based on the concept of term 
rewriting [2] and uses a small but reasonably efficient rewrite 
engine. Expression simplification and symbolic differentia- 
tion are implemented as sets of rewrite rules for this rewrite 
engine. The basic rules are straightforward; however, the 
presence of vectors and matrices introduce a few compli- 
cations and require a careful formalization. In addition, 
AutoBayes contains a rewrite system which implements a 
domain-specific refinement of the standard sign abstraction 
where numbers are not only abstracted into pos and neg but 
also into small (i.e., ] x j < 1) and large. AutoBayes then 
uses a relatively simple symbolic equation solver built on top 
of these rewrite systems. This handles only low-order poly- 
nomials (i.e., linear, quadratic, and simple cubic). However, 
it also shifts and normalizes exponents, recognizes multiple 
roots and bi-quadratic forms, and tries to find polynomial 
factors, and handles expressions in x and (1 — x) which are 
common in statistical applications. 

Backend. The code constructed by schema instantiation 
and composition is represented in an imperative interme- 
diate language. This is essentially a “sanitized” subset of 
C (e.g., no pointers), which is extended by a number of 
domain-specific constructs like vector and matrix operations, 
finite sums, and convergence- loops. Since straightforward 
schema application can produce suboptimal code, AutoBayes 
interleaves synthesis and advanced code optimization (cf. 
[16] for an overview). Schemas can explicitly trigger ag- 
gressive large-scale optimizations like code motion, common 
sub-expression elimination, and memoization which can take 
advantage of information from the model and the synthesis 
process. Traditional low-level optimizations like constant 
propagation or loop fusion, however, are left to the compiler. 

In a final step, AutoBayes translates the intermediate code 
into code tailored for a specific run-time environment. Cur- 
rently, AutoBayes includes code generators for the Octave 
and Matlab environments; it can also produce stand-alone 
C and Modula-2 code. 


sible is necessary. Thus far it has been difficult to use only 
broad band color data to accurately map mass on a broad 
range of scales. Astronomers have only been successful in 
doing this on small numbers of spectroscopically measured 
galaxies (of order 10°). If the errors on what we call "Photo- 
metric Redshifts” can be driven sufficiently low enough we 
can, for the first time, use a sample of order 10 s . This two 
orders of magnitude improvment could have very significant 
implications for contemporary theories of the Universe. 

SDSS [?] photometry (five broad band filters/colors ugriz) 
with calculated accurate photometric redshifts is our goal. 
For example, the SDSS will have 10 6 (to date ~ 125, 000 
measured) galaxy redshifts. The next largest survey has ap- 
proximately 220,000. All of the rest of the redshifts surveys 
do not add up to that of the 2dFGRS alone. SDSS photome- 
try will eventually consist of 10® objects (53 x 10 6 currently). 
Again, no survey approaches this quantity of data. Another 
survey is closest with 400 million objects, but only two ’’col- 
ors” are measured, it is spread across entire sky, is a much 
shallower survey and consists mostly of stars within our own 
galaxy, rather than external galaxies as in the SDSS. 

The results from the latest methods used to attack the Pho- 
tometric Redshifts in the SDSS range from root mean squared 
errors from 0.034 - 0.066 and show considerable variability 
due to sampling. To be able to map the filamentary struc- 
tures in the Universe we need a significant improvement in 
the root mean square error of the competing methods. 

Distance Cutoff >1251 Background Gafaxxw - 164 

600 , . 1 r r« . 1 1 ' 1 


5001 



Figure 2: Example clustering of one small section 
of the sky using spectroscopically determined red- 
shifts. Crosses indicate field-galaxies, i.e., those that 
are not on filaments. Dots indicate galaxies that are 
on filaments. 


5. EXPERIMENTS AND RESULTS 
5.1 Sloan Digital Sky Survey 

Mapping the large scale structure of the universe is nec- 
essary in order to better constrain formation scenarios of 

-structures of all-scales (from-galaxies to large walls) in the 

universe. To this end, measuring the "distances” and x-y 
projections on the sky of the largest number of objects pos- 


The next sections describe the preparatory work that we 
performed with the Gaussian mixture models. These runs 
helped us choose the number of clusters and the priors to 
include when we build the kernel matrix using Equation (3). 
The resulting kernel matrix is then supplied to a Support 
Vector Machin e for classification of galaxies to distinguish 
high from low redshift galaxies, and regression, to predict or 
estimate the actual redshift. 



5.2Choosing Parameters for the Mixture Model 

In a first set of experiments, clustering using AutoBayes 
generated code was used. Our model is a multi-variate mix- 
ture of Gaussians. Figure 1 shows the entire AutoBayes 
specification. We ran this model and varied the desired num- 
ber of classes from 3 to 30. Because our EM algorithm uses a 
randomized initialization, 10 independent runs were carried 
out. Figure 3 shows the log-likelihood for the given parame- 
ters after clustering, the solid line shows its mean. From this 
graph it is obvious that the initialization plays an important 
role as it strongly influences the result. From this figure one 
can also deduce that the best number of clusters for the given 
data set is around 12. For larger numbers of clusters, the 
log-likelihood does not change much, indicating that the in- 
creased model complexity does not appreciably increase the 
fit of the model. For each of the clustering runs, EM needed 
between 5 and 55 iterations, with mean of 14.2 iterations to 
converge. 
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Figure 3: Log-likelihood of a Gaussian mixture 

model with no priors as a function of the number 
of components in the model. Each box corresponds 
to one run. Notice that there is substantial varia- 
tion in the terminal value of the likelihood function, 
which is due to the well-known sensitivity of the EM 
algorithm to initial conditions. 

The redshift of a galaxy has a strong connection on how its 
spectral features are mapped onto the 5 spectral bands given 
by the symbols ( u,g,r,i , and z). If, for example, a signifi- 
cant spectral feature of a near galaxy shows up in band r, the 
corresponding feature of a similar, but distant galaxy would 
be shifted toward the next band, i. Thus, we can assume 
that the data points in the different bands are not uncor- 
related (as in the previous model), but that they have cor- 
relations with the neighboring band. This extended model, 
which has a band covariance matrix, includes a simple trans- 
formation of the data: the new clustering algorithm gets the 
original 5 bands, but also the difference signal between adja- 
cent bands: u — g, g — r, r — i,i — z. Figure shows the results 
of this clustering in terms of the likelihood function. The 
likelihood function shows similar variation, and when penal- 
ized for the additional model complexity, also indicates that 
the correct-number of clusters is around 12. 

In the next experiment, a subset of our training data set was 



Figure 4: Log-likelihood over number of components 
(shifted case) 


used. It contained all data points, for which the measured 
red-shift was larger that 0.3. This data set contains 4530 
of the 52744 data points. A similar clustering experiment 
(with the above AutoBayes model) revealed that the best 
number of clusters for distant galaxies is much lower (around 
5). Figure 5 illustrates this. 
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Figure 5: Log-likelihood over number of compo- 

nents; distant galaxies only 


5.3 Incorporating Prior Knowledge 

In order to incorporate prior information in the AutoBayes 
model, we specified conjugate priors on the mean values p. 
The only changes of the specification are the declarations 
of the prior parameters po and «o, their relationship to the 
mean, and a new optimization goal: 

double mu 0(0. .D-l. 0. -C-l) . 
double kappa_0(0. .D-l, 0. .C-l) . 
mu(_i,_j) ' gauss (nm_0(_i, _j) , 

sqrt(sigma(-i,_j))*kappaJ3(_j.,_il) 

max pr({mu, x} 1 { sigma, phi }) vrt (phi, mu, sigma}. 


The rest of the specification remains unchanged. Auto Bayes 
automatically instantiates the appropriate EM algorithm 
with a highly complex log-likelihood function given in Sec- 
tion 2. A visual investigation of the data displayed in Fig- 
ure 6 indicated that cluster centers need to be placed in 
spectral regions which w'ould model high redshift galaxies. 
We placed 5 clusters, based on the results from the clus- 
tering of distant galaxies only, at the spectroscopic inputs 
corresponding to those galaxies. Those clusters had pri- 
ors associated with them on the r spectral band, since that 
has maximum correlation with the redshift. The remaining 
input dimensions had no priors. Furthermore, we did not 
place priors on the mixture weights or the covariance matri- 
ces. Non-isotropic diagonal covariance matrices were used in 
this study. With this model, we trained 20 mixture models 
to build the Mixture Density Mercer Kernel. 
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Figure 6: This figure shows a multivariate scatter 
plot between the bands ( u,g,r,i,z ) and the redshift. 
Galaxies which are farther away, i.e., those with 
higher redshift have lower spectral energy in the 
band, as expected. Nearby galaxies have high spec- 
tral content. This information was used to set priors 
in subsequent models. 


5.4 Evaluation of Results 

The Mixture Density Mercer Kernel was built using proba- 
bilistic models that included priors as well as those without 
priors on a training set of 1500 galaxies and a test set of 5000 
galaxies. We first submitted the MDMK along with the 5000 
test galaxies to a single CART decision tree module avail- 
able in Matlab. The resulting confusion matrix indicated 
that only 77% of the distant galaxies (those with a redshift 
greater that 0.3) were classified correctly Thus, the model 
had a true positive rate of 77%. Using the Mixture Density 
Mercer Kernel, this rate was dramatically improved to ap- 
proximately 93% using the same training and test data. The 
tree and the MDMK classified approximately 99% and 97% 
of the nearby galaxies correctly. This however, is an easy 
problem since nearby galaxies may have high spectral energy 
content, whereas distant galaxies never have high spectral 
content. It is much more difficult to distinguish far galaxies 
from those that are dim and near. 


The Mixture Density Mercer Kernel performed significantly 
better than the benchmark classifier that we used regardless 
of the use of prior information. It turned out that in this 
application, prior information only improved the results of 
the false negative rate by about 1%, which is within the 
variation due to the model uncertainty. Subsequent research 
into the specific location of the priors and the shape of the 
covariance matrices will be performed. 

In order to further test the quality of the SVM based on 
Mixture Density Mercer Kernels, we built an ensemble of 20 
bagged trees that were built on bootstrap replicates drawn 
from the training set. For this scenario, we found that the 
true positive rate increased from 77% for the single tree to 
86%. The true negative rate remained the same. However 
the true positive rate, although appreciably higher, was still 
lower than the result for the Mixture Density Mercer Kernel 
SVM. 

The next experiment that we performed increased the train- 
ing population for the bagged trees from the original sce- 
nario, where we were drawing bootstrap samples from 1500 
points to 45,000 points, representing a 30 fold increase in the 
amount of training data. This dramatic increase in training 
data helped the bagged trees true positive rate, bringing it 
to approximately 91%, still 2% lower than the result for the 
MDMK SVM, which was built on a data set SO times smaller 
in size. Note that for all experiments described here, we re- 
port the best results out of several runs for all models. 

The significant increase in classification accuracy can be at- 
tributed to the structure induced in the kernel matrix by the 
mixture modelling process. We computed a kernel matrix 
using the procedure outlined in this paper, and evaluated the 
matrix entries using a data set that was sorted in increas- 
ing order of redshift. The resulting matrix clearly shows 
that high redshift galaxies are generally not confused with 
lower redshift galaxies by the model. There are two notable 
exceptions in the matrix. Confusion would be indicated by 
large off diagonal elements in the matrix. Note that we have 
displayed the kernel matrix in sorted order only for illustra- 
tive purposes. The support vector machine’s classification 
accuracy is independent of the order in which the data is 
presented; the underlying mathematics is invariant subject 
to the permutation of the data and the corresponding kernel 
values. 

We ran the Mixture Density Mercer Kernels in a support 
vector regression machine in order to directly estimate red- 
shift. For this problem, regardless of the use of priors, we 
were able to obtain a root mean squared error of approxi- 
mately 0.057 on test data, which is comparable to the er- 
ror rates of published methods on large samples. This data 
shows a great deal of sample-to-sample variability. A CART 
regression tree realized a root mean squared error of approx- 
imately 0.045, which is about 10% better than the MDMK 
described here. These results, however, are not surprising 
since the MDMK is built to have high performamce on clas- 
sification tasks. 


6. CONCLUSIONS 

Our results indicate that for a difficult, real-world classifi- 
cation task, the Mixture Density Mercer Kernel (MDMK) 
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Figure 7: This figure illustrates the reason that the 
Mixture Density Mercer Kernel performs so well on 
the classification task of identifying nearby galaxies 
from those that are far away. Galaxies that are far 
away are bunched together in the lower right hand 
corner of the matrix. 


performs better 16% better than a decision tree. We have 

developed a method to incorporate prior knowledge into 
the model which is a novel approach to learning kernels di- 
rectly from data. The MDMK with priors was built with 
AutoBayes, which automatically generates code to model 
mixture densities with prior information. The AutoBayes 
system generates code to model the mixture density based 
on high-level specifications, automatically instantiates the 
associated EM algorithm schema, performs all necessary op- 
timizations, and generates the symbolic solution along with 
the likelihood function. 

We plan to further investigate the use of prior information 
in the Mixture Density Mercer Kernel framework on other 
real world and synthetic problems. The dramatic increase 
in classification accuracy that is exhibited here is most likely 
due to the way the kernel function is constructed. The use 
of prior information may prove to be very useful as new 
understandings about the data generating process and the 
associate physics arise. 

The results also indicate that the Mixture Density Mercer 
Kernel can be am excellent representation for classification 
problems using very small samples of data. In resource con- 
strained environments, where CPU, RAM, or other com- 
putational power is constrained, this kernel may have util- 
ity. We plan to explore this avenue further to see how the 
MDMK behaves under constrained conditions. We also plan 
to generalize the MDMK to multiclass problems. 
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